EDA Red Wine quality

Data set introduce

数据的具体情况如下:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

从上图可以看到,红酒品质的数据分布并不均匀,主要品质是5和6.

## 
##    Low_Q Meddle_Q   High_Q 
##      744      638      217
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

红酒pHz值主要分布在2.8-3.8, 呈酸性。

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

红酒的特征sulphates数据是长尾分布,取值区间主要为(0.3, 1.2)

红酒的desity呈正太分布.

红酒的特征total.sulfur.dioxide呈长尾分布,主要取值范围在区间(5-80).

红酒的特征free.sulfur.dioxide呈长尾分布,主要取值范围在区间(0-40)

红酒的特征chlorides呈正态分布,主要取值范围在区间(0.03, 0.14).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

红酒的特征fixed.acidity主要取值范围在区间(5, 12), 均值为8.32.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

红酒的特征volatile.acidity主要取值范围在区间(0.2, 1.0), 均值为0.5278.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

红酒中糖分留存量(residual sugar)取值在 (1.9, 2.6), 长尾分布。

可以看到有四个取值数特别高:0, 0.02, 0.24,0.49

Univariate Analysis

What is the structure of your dataset?

  1. 出去X为索引数,总共1599观察数据,每个数据有11个数值变量,还有一个品质评定的分类的变量,没有无效特征值。
  2. 一些特征数据有长尾,如residual.sugar
  3. 一些特征数据有红酒的特征呈正态分布,如chlorides,主要取值范围集中区间(0.03, 0.14)

What is/are the main feature(s) of interest in your dataset?

alcohol和Ph, sulphates

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

比如residual.sugar, , chlorides, density and

Did you create any new variables from existing variables in the dataset?

新增了一个品质等级,相应的总数为: - Low_Q:744

  • Meddle_Q:638

  • High_Q:217

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

有些数据存在长尾现象,进行相应的规范处理。这样可以让数据图看起来更明了,更容易抓住其中隐藏的信息。

Bivariate Plots Section

ref to How to read a box plot/Introduction to box plots

ggcorr函数整体查看下变量之间的相关性,由上图相互间相关系数可知,对红酒品质的影响可以分: 1. alcohol和sulphol都是强正相关; 2. fixed.acidity和citric.acid都是弱正相关; 3. volatile.acidity强负相关; 4. chlorides, density弱负相关

当然,有些特征间相关性极高,还可以看到pH与fixed.acidity相关性高,但对品质相关性为0,很十奇怪。

由上图明显看出,酒精度对品质有明显的影响。

density对酒精度有明显的负影响,那它们间的关系有如何呢?

随着浓度的升高,酒精度下降了。

## $x
## [1] "Quality (score between 3 and 9)"
## 
## $y
## [1] "volatile.acidity (acetic acid - g/dm^3)"
## 
## $title
## [1] "Boxplot of volatile.acidity across Red Wine qualities"
## 
## attr(,"class")
## [1] "labels"

由上图明显看出,volatile.acidity对品质有明显的负影响。

看上图,pH似乎对品质还是有一点点的影响, 具体的相关系数计算如下为:-0.05773139,其实也并非为0.

## [1] -0.05773139

## [1] 0.2513971

看上图,sulphates对品质还是相当的影响, 具体的相关系数: 0.2513971.

citric.acid和volatile.acidity对红酒品质影响十相反的,那相互之间呢?

citric.acid与volatile.acidity含量是负相关的。

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  1. pH似乎对品质还是有一点点的影响,但很小;

  2. sulphates对品质还是相当高的影响;

  3. citric.acid与volatile.acidity含量是负相关的

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

density对酒精度(alcohol)有明显的负影响,随着浓度的升高,酒精度下降了。

What was the strongest relationship you found?

酒精度(alcohol)对红酒品质有非常强的影响

Multivariate Plots Section

上图可知,一般来说,品评好酒需要density/alcohol两个一起看。

上图可分析出,当volatile.acidity/alcohol两个都标注的时候,只需看酒精度就能判断红酒品质高低。

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

品评好酒需要density/alcohol两个一起看。

Were there any interesting or surprising interactions between features?

当volatile.acidity/alcohol两个都标注的时候,只需看酒精度就能判断红酒品质高低。

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = training_data)
## m2: lm(formula = quality ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = quality ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = quality ~ alcohol + sulphates + volatile.acidity + 
##     chlorides, data = training_data)
## m5: lm(formula = quality ~ alcohol + sulphates + volatile.acidity + 
##     chlorides + pH, data = training_data)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)           1.736***      1.221***      2.421***      2.588***      3.987***  
##                        (0.196)       (0.199)       (0.218)       (0.223)       (0.440)    
##   alcohol               0.373***      0.356***      0.323***      0.308***      0.323***  
##                        (0.019)       (0.018)       (0.018)       (0.018)       (0.019)    
##   sulphates                           1.052***      0.724***      0.880***      0.856***  
##                                      (0.119)       (0.117)       (0.126)       (0.126)    
##   volatile.acidity                                 -1.194***     -1.147***     -1.031***  
##                                                    (0.106)       (0.107)       (0.111)    
##   chlorides                                                      -1.550**      -1.902***  
##                                                                  (0.475)       (0.483)    
##   pH                                                                           -0.476***  
##                                                                                (0.129)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.237         0.281         0.346         0.352         0.358     
##   adj. R-squared        0.236         0.280         0.345         0.349         0.356     
##   sigma                 0.709         0.688         0.656         0.654         0.651     
##   F                   396.692       249.540       224.956       172.647       142.196     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1373.215     -1335.102     -1274.541     -1269.230     -1262.446     
##   Deviance            641.162       604.067       549.487       544.942       539.191     
##   AIC                2752.429      2678.205      2559.082      2550.461      2538.891     
##   BIC                2767.891      2698.820      2584.851      2581.384      2574.968     
##   N                  1279          1279          1279          1279          1279         
## ==========================================================================================

我们的模型可以推算误差靠近0,但是不同品质的酒结果并不一样。

Final Plots and Summary

Plot One

Description One

根据红酒的评分,划分了三个等级,相应的分布如上所示,低品质的酒要明显高于高品质,看来买到好酒不容易。

Plot Two

Description Two

由上图明显看出,酒精度对品质有明显的影响。所以,买酒先看酒精度数。

Plot Three

Description Three

density/alcohol都对红酒的品质有明显的影响。

Reflection

  1. 通过本次作业,熟练R相关的操作,特别ggplot,相关作图非常清晰,简单。虽然与python语法很大不同,但操作更简单。
  2. 分析模型那部分可看到,数据少的组,得出的错误率明显要高,怎么提高模型的精度是下一步可以进行的学习。
  3. 如何分析更多,不那么整洁的数据,还需要更多的练习。